Due on Friday 4/5
Looking ahead to the rest of the semester
By the end of today's lecture, you will learn:
We are in the middle of Unit 4 on data confidentiality. Recall that cryptographic systems attempt to provide (some or all of) the following three objectives.
These three security objectives are sometimes referred to as the "CIA triad."
We can further sub-divide data confidentiality into three different scenarios.
Protecting data in transit, for instance by using end-to-end encrypted messengers like WhatsApp or iMessage. This involves encryption from Alice to Bob, in order to stop anyone else on the Internet from reading their messages. (But note that Alice and Bob implicitly trust each other with the message contents.)
Both of these two scenarios can be achieved using authenticated encryption, the tool we designed over the past two weeks.
This may sound counterintuitive to you, or perhaps even impossible. After all, if you want someone to conduct data analysis for you, don't you have to give them your data? That's how modern cloud computing works.
Our goal is to spend the next two weeks exploring this problem. My goal is to convince you that it is possible to conduct data science without data sharing.
To solve this problem, we will need a new data confidentiality tool besides encryption. Encryption has an all-or-nothing property: anyone who knows the symmetric key (e.g., Alice or Bob) can read everything, whereas anyone without the key (e.g., Mallory) learns nothing.
Instead, we want a new system where the people you're working with and the people you don't fully trust are the same people. In other words, what happens when Bob is also Mallory?
“Cryptography is how people get things done when they need one another, don’t fully trust one another, and have adversaries actively trying to screw things up.”
– Ben Adida
To begin this journey, let's recall the idea of cryptographic secret sharing that you all saw a few weeks ago in discussion.
Secret sharing enables a group of parties to jointly hold a secret, but in such a way that no individual person or small coalition in the group knows the secret.
Each person holds only a single share of the secret -- where the word "share" here is used in the sense of a stock share, not in the sense of sharing data.
Some people prefer the term secret splitting instead, as it is a more accurate description of what's happening.
For example: think of a box that contains a secret that nobody knows. The box is secured with a lock. In order to open the lock, it requires a set of keys to be used.
Otherwise, the safe would remain closed and nobody would have access to it.
Informally, a $(t, n)$-secret sharing scheme splits the secret $s$ into $n$ shares, such that any $t − 1$ of the shares reveal no information about $s$, while any $t$ shares allow complete reconstruction of the secret $s$.
A formal definition follows.
Definition. A cryptographic secret sharing scheme is defined by two algorithms:
$\text{split}(secret,t,n) \to \vec{s} = [s_1, s_2, \ldots, s_n]$. This is a randomized method that splits a secret into a vector of $n$ shares, such that any subset of shares of size $t$ can reconstruct the original secret.
$\text{reconstruct}(s_{i_1}, \ldots, s_{i_t}) \to s$ where $i_1, i_2, \ldots, i_t \in \{1, 2, \ldots, n\}$. This is typically a deterministic method, and it takes $t$ distinct secret shares (i.e., a sub-vector of $\vec{s}$) and produces the original secret.
The scheme must satisfy two security properties.
split any secret $s$, then reconstruct always recovers the original secret $s$.Perfect privacy is defined similarly to the perfect secrecy of a one-time pad: for any two possible secrets $s$ and $s'$, and for any vector of $t-1$ or fewer secret shares $\vec{x} = \{x_{i_1}, x_{i_2}, \ldots, x_{i_t}\}$, it is the case that:
$$ \Pr[\text{all }x_i = s_i \mid \vec{s} = \text{split}(s, t, n)] = \Pr[\text{all }x_i = s'_i \mid \vec{s} = \text{split}(s', t, n)].$$
In words: perfect privacy says that the shares held by a coalition of $t-1$ parties have an equal probability of being shares of $s$ or shares of $s'$.
For today, we will focus on an additive secret sharing protocol.
Suppose that Susie has a secret $s$ that she wants to split between two of her friends Alice and Bob. However, Susie does not trust either Alice or Bob enough to hand over her secret.
Instead, she wants each of her friends to hold onto one share of the secret on their local hard drives. This way:
Here is a protocol that allows Susie to accomplish this goal. Suppose that her secret $s$ is $k$ bits in length. Susie can split it into two shares as follows.
To reconstruct the secret later, calculate $s \gets s_1 \oplus s_2$.
I claim that this is a secret sharing scheme among $n = 2$ parties that requires a threshold of $t = 2$ of them to reconstruct.
There is a subtle but essential probability argument that is necessary to make this argument rigorous.
$s_1$ has the uniform distribution over all length-$k$ bitstrings, by construction. So it looks like a OTP key.
$s_2$ on its own also has the uniform distribution over all length-$k$ bitstrings! This is because a uniformly-random bitstring xored with a constant bitstring has a uniform distribution.
For example, suppose that the secret $s = \texttt{111..1}$. Then $s_2 = \texttt{111..1} \oplus s_1$ is the result of taking a uniformly-random bitstring and flipping all of its bits. The result also has the uniform distribution.
We can also construct a secret sharing scheme using addition, rather than the xor operation. Here's an alternative approach for Susie to split her secret:
To reconstruct the secret later, calculate $s \gets s_1 + s_2 \bmod{2^k}$.
Correctness and privacy hold for essentially the same reasons as before. Do you see why?
Also, both the boolean and arithmetic secret sharing approaches immediately generalize to the $n$ party setting. Susie can split her secret $s$ among $n$ parties by sampling shares such that:
$$ s = s_1 + s_2 + \cdots + s_n \bmod{2^k}.$$Subsequently, all $n$ parties must honestly participate in order to reconstruct the original secret $s$.
But hang on a second. I started today by discussing three types of data protection: data at rest, data in transit, and data in use.
So far, we have just discussed secret sharing as a new way for Susie to protect data at rest. Granted, the data is stored at rest with Alice and Bob rather than on Susie's own laptop... but still the data is being protected at rest.
What I would ideally like to do is to perform data analysis on the data while it is inside the box.
[Image source: NASA]
Let me introduce this new way to conduct data science by way of a story -- a true story.
In 2013, then-Boston Mayor Tom Menino commissioned a working group to write a report about how to "make Boston the premier city for working women."
The Mayor's office wrote a report, and they established a new group called the Boston Women's Workforce Council that brought together leading women in the state from government, industry, and academia to find and implement ways to close the gender and racial wage gap.
The Council established a Compact -- or pledge -- that has been signed by more than 250 companies throughout the greater Boston area. The companies promise to do three things; I want to draw your attention to the part about contributing data toward measuring the city-wide gender and racial wage gap.
[Image source: Boston Women's Workforce Council]
The text of the pledge reveals the tension between data confidentiality and data analysis.
Contributing their employee data anonymously. Individual companies don't want to publish their salary data.
Measurement to create a snapshot of the gender/racial wage gap in Greater Boston. Everyone is interested in learning about the overall state of the gender and racial wage gap, as part of their efforts to close the gap and move toward a world of 100% equal pay for equal work.
Question. How can the companies resolve this tension? How can they perform data analysis while preserving data confidentiality?
The BWWC initially tried to find a trusted external party to hold onto the data and compute the city-wide wage gap. However, by 2014 it was clear that this wasn’t going to work, for two reasons.
But this story isn't over. The Boston Women's Workforce Council has run the data analysis to calculate the city-wide gender and racial wage gap. In fact they've done it 5 times now: in 2016, 2017, 2019, 2021, and 2023.
So how did they do it? The answer lies in a new cryptographic primitive that we will study for the next week: secure multi-party computation.
Several of us at Boston University have worked with the Boston Women's Workforce Council and the city to design, build, and deploy a web-based survey form. This webpage allows companies to contribute their data toward the wage gap calculation without anyone learning their salary data -- not even us!
Here's what our website looks like. The intention is for someone in (say) the HR department of each participating company to input the number and total annual compensation of all employees in each group of gender, race/ethnicity, and job category.
The form is designed to follow the format of a U.S. Equal Employment Opportunity Commission form that all large employers must fill out every year. So it is a format that our target audience is already familiar with.
So far, this looks like an ordinary website. The only special part about it is what happens after you click on the "submit" button.
With a normal website, clicking "submit" would cause the data to be sent over to the server. It would likely be encrypted in transit, but it would be viewable by the people running the server (in this case, us at BU).
But this website is different. When you click on the "submit" button, the data itself doesn't go anywhere!
Instead, we use cryptographic secret sharing to split the data between BU and the Boston Women's Workforce Council.
The objective is to learn a single spreadsheet containing the sum across all spreadsheets provided by the hundreds of companies. That is: we want to know the city-wide total for each cell in the spreadsheet, without learning any information about individual employees or employers.
Let's see how this works using a smaller-scale version of the wage gap project. For now, let's suppose that:
Let’s call the employers’ individual answers $a$, $b$, and $c$, respectively. Our objective is for:
(Note: technically everyone is interested in the average, not the sum. But since the number of companies is known, it's straightforward to convert from one to the other: just divide by 3.)
| $a$ |
$b$ |
$c$ |
|||
|---|---|---|---|---|---|
First, each company visits the website and splits their secret data between BU and the Council. Remember that these shares (e.g., $a_1$ and $a_2$) have a uniform distribution subject to the fact that they sum to the real secret (e.g., $a = a_1 + a_2 \bmod{2^k}$). So, neither BU or the Council on their own have learned anything so far because they each hold a random number.
| $a$ |
$b$ |
$c$ |
|||
|---|---|---|---|---|---|
| $a_1$ | $b_1$ | $c_1$ | |||
| $a_2$ | $b_2$ | $c_2$ |
Our overall goal is for the City of Boston to be able to compute $a + b + c$, which is the sum of all six secret shares.
Here comes the key insight: because addition is commutative, it doesn't matter whether we first add the numbers vertically and then horizontally, or in the other direction. And the other direction preserves privacy!
Second, BU and the Council add up the three secret shares that they possess, and send these results over to the City.
Each of these numbers individually is uniformly random, and it doesn’t reveal anything about $a$, $b$, or $c$ individually, so it preserves the companies' privacy.
| $a$ |
$b$ |
$c$ |
|||
|---|---|---|---|---|---|
| $a_1$ | $b_1$ | $c_1$ | $a_1 + b_1 + c_1$ | ||
| $a_2$ | $b_2$ | $c_2$ | $a_2 + b_2 + c_2$ |
Finally, the City of Boston can add the numbers on the right to reconstruct the sum $a + b + c$, whcih is the desired city-wide aggregate statistic.
This is what our web-based software does behind the scenes. It just does it for each of the 100+ participating companies, and for each of the 600+ cells in the spreadsheet.
| $a$ |
$b$ |
$c$ |
$a + b + c$ |
||
|---|---|---|---|---|---|
| $a_1$ | $b_1$ | $c_1$ | $a_1 + b_1 + c_1$ | ||
| $a_2$ | $b_2$ | $c_2$ | $a_2 + b_2 + c_2$ |
Great! We now know how to perform data analysis over data that we cannot see ... well, as long as the only data analysis we want to perform is addition.
The good news is that you can perform cryptographically secure multi-party computation of any function!
Concretely, consider the following setting:
Secure multi-party computation (often abbreviated as MPC) allows the parties to:
Now that we understand how MPC works informally, this is usually where I would provide a scientific definition. I'm not going to do that this time though, because the formal definition of MPC is somewhat complicated. You'll read about it in this week's required reading.
Instead, I reproduce below some text from a recent bill in the U.S. Congress that provides a legal definition of MPC. (Emphasis added by me.)
The term "secure multi-party computation" means a computerized system that enables different participating entities in possession of private sets of data to link and aggregate their data sets for the exclusive purpose of performing a finite number of pre-approved computations without transferring or otherwise revealing any private data to each other or anyone else. – 2022 U.S. Senate bill S.3952
We can consider MPC against two kinds of adversaries (the same ones we considered for encryption):
The protocol I showed you today only protects against a passive eavesdropper Eve.
If, for example, we at BU tampered with the data, then we could cause the output of the calculation to be incorrect. That is: we can break data integrity. Even so, there's no way for us to break data confidentiality.
| $a$ |
$b$ |
$c$ |
$a + b + c \color{red}{+ 10}$ |
||
|---|---|---|---|---|---|
| $a_1$ | $b_1$ | $c_1$ | $a_1 + b_1 + c_1 \color{red}{+ 10}$ | ||
| $a_2$ | $b_2$ | $c_2$ | $a_2 + b_2 + c_2$ |
For today, let's continue looking at MPC against a passive eavesdropper Eve. We'll look next time at how to protect integrity against an active adversary Mallory.
Secure multi-party computation protocols have three steps:
Steps 1 and 3 come directly from the split and reconstruct operations of the secret sharing construction.
The main challenge is in step 2. So far, we only know how to add secrets by adding their corresponding secret shares.
The ideas that we've already shown for secure addition easily extend to cover all linear functions!
For instance, suppose that rather than calculating $a + b + c$, we actually wanted to compute $2a + 3b - c + 10$.
Since we are using additive secret sharing, and scalar multiplication also commutes over addition, each party can just compute its share of the result.
| $a$ |
$b$ |
$c$ |
$2a + 3b - c + 10$ |
||
|---|---|---|---|---|---|
| $a_1$ | $b_1$ | $c_1$ | $2a_1 + 3b_1 - c_1 + 10$ | ||
| $a_2$ | $b_2$ | $c_2$ | $2a_2 + 3b_2 - c_2 + 0$ |
Next, let's try to multiply two secrets. Suppose that two data holders input $a$ and $b$, respectively. Can we somehow calculate secret shares of the product $c = a \cdot b$?
| $a$ |
$b$ |
$c = a \cdot b$ |
||
|---|---|---|---|---|
| $a_1$ | $b_1$ | $\color{red}{??}$ | ||
| $a_2$ | $b_2$ | $\color{red}{??}$ |
This is not as easy to compute. This time we cannot rely solely on commutativity of addition; we need to consider the distributive law as well.
Recall the relationships between the secret variables and their additive shares (all equations will be mod $2^k$, but I'm going to omit writing the modulus for brevity): $$\begin{align*} a &= a_1 + a_2 \\ b &= b_1 + b_2. \end{align*}$$
Hence, we want to compute: $$\begin{align*} c &= a \cdot b \\ &= (a_1 + a_2) (b_1 + b_2) \\ &= a_1 b_1 + a_1 b_2 + a_2 b_1 + a_2 b_2. \end{align*}$$
BU knows how to compute the term $a_1 b_1$ on its own, and the Council knows how to compute $a_2 b_2$ on its own. But it's unclear how to compute the cross-terms $a_1 b_2$ and $a_2 b_1$ -- that is, it's unclear how to calculate these terms without the computing parties sending their shares to each other, but that would break data confidentiality.
I'll show you today one way to resolve this problem. It relies on two useful insights.
Insight #1: let's add a third computing server.
| $a$ |
$b$ |
$c = a \cdot b$ |
||
|---|---|---|---|---|
| $a_1$ | $b_1$ | $\color{red}{??}$ | ||
| $a_2$ | $b_2$ | $\color{red}{??}$ | ||
| $a_3$ | $b_3$ | $\color{red}{??}$ |
At first glance, this might look to make the calculation even harder to perform. Now the distributive law says that we must calculate:
$$\begin{align*} c &= (a_1 + a_2 + a_3) (b_1 + b_2 + b_3) \\ &= a_1 b_1 + a_1 b_2 + a_1 b_3 \\ &\quad + a_2 b_1 + a_2 b_2 + a_2 b_3 \\ &\quad + a_3 b_1 + a_3 b_2 + a_3 b_3. \\ \end{align*}$$Insight #2: Let's give each computing party two of the three shares. Note that this is still not enough for each individual computing party to reconstruct any secrets, but now a threshold of $t = 2$ of the 3 parties will collectively have enough information to reconstruct secrets.
The punchline here is that: now there exists a computing server that can calculate each of the 9 terms in this product! So the three computing servers can calculate $a \cdot b$ by working together in such a way that nobody has learned either $a$ or $b$.
$$\begin{align*} c &= (a_1 + a_2 + a_3) (b_1 + b_2 + b_3) \\ &= \color{red}{a_1 b_1} + \color{red}{a_1 b_2} + \color{gray}{a_1 b_3} \\ &\quad + \color{red}{a_2 b_1} + \color{blue}{a_2 b_2} + \color{blue}{a_2 b_3} \\ &\quad + \color{gray}{a_3 b_1} + \color{blue}{a_3 b_2} + \color{gray}{a_3 b_3}. \\ \end{align*}$$| $a$ |
$b$ |
$c = a \cdot b$ |
||
|---|---|---|---|---|
| $a_1, a_2$ | $b_1, b_2$ | $c_1 = a_1b_1 + a_1b_2 + a_2b_1$ | ||
| $a_2, a_3$ | $b_2, b_3$ | $c_2 = a_2b_2 + a_2b_3 + a_3b_2$ | ||
| $a_3, a_1$ | $b_3, b_1$ | $c_3 = a_3b_3 + a_3b_1 + a_1b_3$ |
There's just one downside here: the computing parties have two shares of $a$ and $b$, but only one share of $c$. So we cannot continue and multiply $c$ times some other secret.
This is easy to fix with some communication: each computing party gives its share to one other party.
Now our invariant is maintained: each computing party starts with 2 out of 3 shares of each input, and completes multiplication with 2 out of 3 shares of each output.
To summarize, we have constructed a secure multiplication scheme with:
So far, we have seen one way to perform secure multi-party computation of $+$ and $\times$.
Importantly: $+$ and $\times$ are a Turing-complete set of gates! That is, we can write any data analytic as a circuit with $+$ and $\times$ gates. (This may not be the fastest way to compute $f$, but it will always work in principle.)
For instance, given the circuit below along with secret shares of $s$, $t$, and $x$ it is possible for the 3 computing parties to work together to calculate:
and then reconstruct only the output $y$.
Hence, the protocol we have constructed today can do a secure computation of any function, as long as:
It turns out that you can also perform MPC in the dishonest majority setting, and against an actively-malicious adversary Mallory. We'll see how to do that in future lectures.